Lecture 3

This lecture taught by of Prof. Cathy Yi-Hsuan Chen focuses on web language including HTML, XML, JSON, and the task of parsing, and RSS news feeds

Specifically, the code can be found in the Github

Here is a video from Coursera

Outlines

Web language
- HTML
- XML
- JSON
Python request module
Parsing
- Parsing XML
- Parsibg JSON
RSS news feeds
- Financial times news feeds
- Wall street journal news feeds
Coursework
Additional Resources

Web language

HTML (Hypertext Markup Language): the standard markup language for documents designed to be displayed in a web browser. HTML was designed to display data - with focus on how data looks

XML (eXtensible Markup Language): XML was designed to carry data - with focus on what data is, and XML tags are not predefined like HTML tags are

JSON (JavaScript Object Notation): JSON is a syntax for storing and exchanging data between a browser and a server

HTML

Describes the structure of a Web page

Consists of a series of elements represented by tags

Elements tell the browser how to display the content

HTML tags label pieces of content such as "heading", "paragraph", "table", and so on

Browsers do not display the HTML tags, but use them to render the content of the page

<!DOCTYPE html>        # declaration represents the document type
<html>             # element is the root element of an HTML page
<head>             # element contains meta information about the document
<title>Page Title</title>  # specifies title for the document
</head>
<body>            # contains the visible page content

<h1>My First Heading</h1>    # defines a large heading
<p>My first paragraph.</p>     # defines a paragraph

</body>
</html>

XML

XML language has no predefined tags

Tags are "invented" by the author of the XML document

Author must define both the tags and the document structure

<note>
  <date>2015-09-01</date>
  <hour>08:30</hour>
  <to>Tove</to>
  <from>Jani</from>
  <body>Don't forget me this weekend!</body>
</note>

<employees>
  <employee>
    <firstName>John</firstName> <lastName>Doe</lastName>
  </employee>
  <employee>
    <firstName>Anna</firstName> <lastName>Smith</lastName>
  </employee>
  <employee>
    <firstName>Peter</firstName> <lastName>Jones</lastName>
  </employee>
</employees>

JSON

JSON is a syntax for storing and exchanging data

JSON is text, written with JavaScript object notation

A lightweight data-interchange format, and it is "self-describing"

We can also convert any JSON received from the server into JavaScript objects, and work with the data as JavaScript objects

JSON Syntax Rules

Data is in key/value pairs

Data is separated by commas

Curly braces hold objects

Square brackets hold arrays

{"employees":[
  { "firstName":"John", "lastName":"Doe" },
  { "firstName":"Anna", "lastName":"Smith" },
  { "firstName":"Peter", "lastName":"Jones" }
]}

Python Requests Module

Make a request to a web page, and print the response text

import requests
# Sends a GET request to the specified url
x = requests.get('https://w3schools.com/python/demopage.htm') 
print(x.text)

Parsing

Parsing is the process of analyzing a string of symbols. The term parsing comes from Latin pars (orationis), meaning part (of speech). Within computational linguistics the term is used to refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic and other information.

Parser

A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure, giving a structural representation of the input.
In the case of data languages, a parser is often found as the file reading facility of a program, such as reading in HTML or XML text; these examples are markup languages.

Parsing XML

import requests
import xml.dom.minidom   # module for XML parser

response = requests.get(
    "https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Datasets/daily_treas_bill_rates.xml")
content_1 = response.content
dataDOM_1 = xml.dom.minidom.parseString(content_1)

response = requests.get(
    "https://news.google.com/news/rss/headlines/section/q/finance%20news/finance%20news?ned=us&hl=en")
content_2 = response.content
dataDOM_2= xml.dom.minidom.parseString(content_2)

Parsing JSON

import requests
import json
import pandas as pd

url = 'http://data.thecrix.de/data/crix.json'
r = requests.get(url)

content = r.content

# json.loads : parse a JSON string
js_content = json.loads(content)
for item in js_content:
    print(item)

# make a data frame
data_raw = pd.DataFrame(js_content)
data_raw.set_index(keys='date', inplace=True)

# make a time-series plot
data_raw.plot()

RSS news feed

RSS (originally RDF Site Summary) is a web feed which allows users and applications to access updates to websites in a standardized, computer-readable format. These feeds can, for example, allow a user to keep track of many different websites in a single news aggregator. The news aggregator will automatically check the RSS feed for new content, allowing the list to be automatically passed from website to website or from website to user.

note: web feed (or news feed) is a data format used for providing users with frequently updated content.

Financial times news feed

Please visit Financial times RSS feed, and click business education RSS feed

import feedparser   # parser for parsing RSS feed

# retrieve RSS feedback
content = feedparser.parse("https://www.ft.com/?edition=international&format=rss")

# list all titles
print("\nTitles-------------------------\n")
for index, item in enumerate(content.entries):
    print("{0}.{1}".format(index, item["title"]))

# list all description
print("\r\nDescriptions-------------------\r\n")
for index, item in enumerate(content.entries):
    print("{0}.{1}\n".format(index, item["description"]))

Wall Street Journal news feed

please visit Wall street journal RSS feed

Choose the news category of interest, for instance U.S. Business

mport feedparser   # parser for parsing RSS feed

# retrieve RSS feedback for US. business news
content = feedparser.parse("https://feeds.a.dj.com/rss/WSJcomUSBusiness.xml")

# list all titles
dfData_title = pd.DataFrame(columns=['title'])
for index, item in enumerate(content.entries):
    dfData_title = dfData_title.append({'title': item["title"]}, ignore_index=True)
    print("{0}.{1}".format(index, item["title"]))

# list all description
print("\r\nDescriptions-------------------\r\n")
dfData_des = pd.DataFrame(columns=['description'])  # create a dataframe
for index, item in enumerate(content.entries):
    dfData_des = dfData_des.append({'description': item["description"]}, ignore_index=True)
    print("{0}.{1}\n".format(index, item["description"]))

Coursework

Please search for other news feed, and try to implement parsing news feed

You can consider BBC news feed, news categories in Wall street journal RSS feed, or others in the worldwide.

Lecture 3

Lecture 3

Outlines

Web language

HTML

XML

JSON

Python Requests Module

Parsing

Parser

Parsing XML

Parsing JSON

RSS news feed

Financial times news feed

Wall Street Journal news feed

Coursework

Additional Resources

results matching ""

No results matching ""